The banking industry heavily relies on the profitability of home loans, which are primarily sought after by individuals with regular or substantial incomes. However, the specter of loan defaults presents a significant financial risk, potentially negating the profits from these loans. Historically, banks have navigated the loan approval process through meticulous manual reviews, a method that, while thorough, is prone to inefficiencies, human error, and biases. With the evolution of technology, there's been a shift towards automating this process to enhance efficiency and objectivity. The advent of data science and machine learning offers a promising avenue for developing sophisticated models that can predict loan default risks more accurately, thereby making the loan approval process more streamlined, unbiased, and effective.
The primary aim is to revolutionize the loan approval process by implementing a classification model that predicts the likelihood of loan defaults. This initiative, rooted in the principles of the Equal Credit Opportunity Act, seeks to harness recent loan application data and insights from the bank's loan underwriting process to construct a model grounded in empirical data and statistical validity. The model is expected not only to enhance predictive accuracy but also to ensure transparency and fairness, particularly in rejection cases. By identifying key predictive factors, the model will provide actionable insights, enabling the bank to make more informed decisions, optimize the approval process, and ultimately minimize the risk of defaults.
What are the primary factors contributing to loan defaults?
Identifying the most influential variables that predict default can help in tailoring the model to focus on the most relevant data points. For that we have to do EDA, and model relationships
How can the model incorporate the guidelines of the Equal Credit Opportunity Act to ensure fairness and avoid bias?
To ensure the predictive model for loan approvals aligns with the Equal Credit Opportunity Act and utilizes the available data effectively, it's essential to focus on non-discriminatory, financially relevant features like income, debt levels, payment history, and assets, while excluding or carefully scrutinizing variables that could indirectly relate to protected characteristics. Incorporating bias detection and mitigation strategies throughout the model's development and application phases is crucial, employing statistical analysis to identify biases and applying algorithms to reduce them. Emphasizing model interpretability and transparency allows for the provision of clear, understandable justifications for credit decisions, meeting ECOA requirements. Regular validation, continuous monitoring for fairness, and legal compliance reviews ensure the model remains unbiased and effective over time. Feedback mechanisms further refine the model, ensuring it reflects equitable credit decision practices while leveraging data points like "DEBTINC," "CLAGE," "DELINQ," and "DEROG" to assess creditworthiness comprehensively and fairly.
What predictive modeling techniques will be most effective and interpretable for this application?
Determining the balance between model complexity, accuracy, and interpretability to ensure that decisions can be explained and justified.
Logistic Regression
Pros:
Cons:
Decision Trees
Pros:
Transparent Decision Process: The hierarchical structure of decisions based on feature values is easy to visualize and understand.
Handles Non-linearity: Can model complex relationships without needing the data to be linearly separable.
Cons:
Overfitting: Tends to overfit the training data, making the model less generalizable to unseen data.
Instability: Small changes in the data can result in significantly different trees.
Random Forest
Pros:
Cons:
How can the bank use the model's insights to optimize its loan approval process?
Translating model predictions into actionable strategies for assessing loan applications can enhance decision-making efficiency and accuracy. This will be answered at the end
What measures will be taken to validate the model's predictions and assess its performance over time?
Establishing criteria for model evaluation and continuous improvement to ensure it remains accurate and relevant as new data becomes available. For training we could use Cross-Validation to assess the model's performance across different subsets of the data. This helps in estimating the model's generalization ability to unseen data and a Confusion Matrix to Evaluate the model's performance in terms of precision, recall, accuracy, and F1 score. This is particularly important for classification models to understand the trade-offs between different types of errors (false positives and false negatives).
How will adverse actions (loan rejections) be communicated to ensure transparency and provide justification based on the model’s findings?
The model plays a crucial role in assisting the bank to address the issues of transparency and fairness in communicating adverse actions, like loan rejections, by providing clear, interpretable insights into the decision-making process. By leveraging interpretable modeling techniques and incorporating explanation frameworks, the model can identify and communicate the specific reasons contributing to a loan rejection, such as high debt-to-income ratios or insufficient credit history, in a manner that is understandable to applicants.
The overarching goal is to streamline and improve the loan approval process, making it more efficient, fair, and free from biases. By developing a classification model based on empirical data and statistical analysis, we seek to:
Enhance Decision-Making Accuracy: Automate the prediction of loan default risk with a high degree of accuracy, allowing the bank to make informed lending decisions.
Ensure Fairness and Compliance: Adhere to the Equal Credit Opportunity Act's guidelines, ensuring that the model's decisions are devoid of biases that could unfairly affect certain groups of applicants.
Improve Efficiency: Reduce the time and resources currently required for manual loan approval processes, thereby increasing operational efficiency.
Maintain Transparency: Build a model that is not only predictive but also interpretable, enabling the bank to provide clear justifications for loan approvals or rejections, thus maintaining transparency with applicants.
Identify Key Predictive Features: Determine the most significant factors that predict loan defaults, offering insights that can guide the bank's policies and strategies regarding loan approvals.
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from scipy.stats import chi2_contingency, ttest_ind
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from scipy.stats.mstats import winsorize
from google.colab import drive
drive.mount('/content/drive/')
Mounted at /content/drive/
data = pd.read_csv('/content/drive/My Drive/Colab_Notebooks/hmeq.csv')
data.shape
(5960, 13)
# get the first 5 rows of the data
data.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
# get the last 5 rows of the data
data.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | 0.0 | 16.0 | 36.112347 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | 0.0 | 15.0 | 35.859971 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | 0.0 | 15.0 | 35.556590 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | 0.0 | 16.0 | 34.340882 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | 0.0 | 16.0 | 34.571519 |
# get data datatypes and non-nulls
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
duplicates = data.duplicated()
duplicates
0 False
1 False
2 False
3 False
4 False
...
5955 False
5956 False
5957 False
5958 False
5959 False
Length: 5960, dtype: bool
There are no duplicates in the dataset
#knowing the percentage of null data
total_nulls = data.isnull().sum().sum()
print(f"Total null values in DataFrame: {total_nulls}")
null_percentage = data.isnull().mean() * 100
print(f"Null values percenge per column")
print(null_percentage)
Total null values in DataFrame: 5271 Null values percenge per column BAD 0.000000 LOAN 0.000000 MORTDUE 8.691275 VALUE 1.879195 REASON 4.228188 JOB 4.681208 YOJ 8.640940 DEROG 11.879195 DELINQ 9.731544 CLAGE 5.167785 NINQ 8.557047 CLNO 3.724832 DEBTINC 21.258389 dtype: float64
msno.matrix(data)
plt.show()
Our dataset contains missing values across various columns, with the proportion of missing data ranging from approximately 1.88% to 21.26% across a total of 5060 rows. This situation requires us to formulate certain assumptions about the nature and impact of these missing values.
Given the substantial presence of missing data, it's crucial to incorporate this consideration into our exploratory data analysis (EDA). We aim to understand the pattern of missingness, determining whether the data are missing at random (MAR), missing completely at random (MCAR), or missing not at random (MNAR). Additionally, we need to examine the inter-variable relationships of the missing data to prevent the introduction of biases. This careful approach ensures a more accurate analysis and interpretation of our dataset.
data.describe()
| BAD | LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5960.000000 | 5960.000000 | 5442.000000 | 5848.000000 | 5445.000000 | 5252.000000 | 5380.000000 | 5652.000000 | 5450.000000 | 5738.000000 | 4693.000000 |
| mean | 0.199497 | 18607.969799 | 73760.817200 | 101776.048741 | 8.922268 | 0.254570 | 0.449442 | 179.766275 | 1.186055 | 21.296096 | 33.779915 |
| std | 0.399656 | 11207.480417 | 44457.609458 | 57385.775334 | 7.573982 | 0.846047 | 1.127266 | 85.810092 | 1.728675 | 10.138933 | 8.601746 |
| min | 0.000000 | 1100.000000 | 2063.000000 | 8000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.524499 |
| 25% | 0.000000 | 11100.000000 | 46276.000000 | 66075.500000 | 3.000000 | 0.000000 | 0.000000 | 115.116702 | 0.000000 | 15.000000 | 29.140031 |
| 50% | 0.000000 | 16300.000000 | 65019.000000 | 89235.500000 | 7.000000 | 0.000000 | 0.000000 | 173.466667 | 1.000000 | 20.000000 | 34.818262 |
| 75% | 0.000000 | 23300.000000 | 91488.000000 | 119824.250000 | 13.000000 | 0.000000 | 0.000000 | 231.562278 | 2.000000 | 26.000000 | 39.003141 |
| max | 1.000000 | 89900.000000 | 399550.000000 | 855909.000000 | 41.000000 | 10.000000 | 15.000000 | 1168.233561 | 17.000000 | 71.000000 | 203.312149 |
Leading Questions:
Preliminary Intuition behind the data
This is a raw intuation before seeing the data, but could lead decisions on how to handle outliers and missing data and how can influence defaults
BAD (Loan Default Indicator): This binary variable is the target outcome, indicating whether a loan was repaid or defaulted. It directly reflects the risk the bank is trying to mitigate.
LOAN (Amount of Loan Approved): Larger loan amounts may correlate with a higher risk of default due to the increased financial burden on the borrower.
MORTDUE (Amount Owed on Mortgage): A higher mortgage due could indicate financial strain, potentially increasing the likelihood of default, especially if it significantly outweighs the borrower's assets or income.
VALUE (Current Value of Property): Properties with higher values may indicate borrowers with more assets, possibly correlating with a lower default risk. However, market fluctuations can affect property values and, subsequently, this relationship.
REASON (Reason for Loan): typically categorized as "HomeImp" for home improvement or "DebtCon" for debt consolidation, can provide valuable context for assessing default risk. Loans taken out for home improvement ("HomeImp") might suggest an investment in the property's value, potentially indicating financial stability and planning. Conversely, loans for debt consolidation ("DebtCon") could indicate attempts to manage existing financial strain or overextension, which might carry a different risk profile.
JOB (Type of Job): The job type can provide insights into income stability and levels. Certain professions might inherently carry more stable income prospects, affecting default risk.
YOJ (Years at Present Job): Longer employment duration may suggest job stability, which could correlate with a lower risk of default due to consistent income.
DEROG (Number of Major Derogatory Reports): A higher number of derogatory marks on a borrower's credit report can be a strong predictor of default, reflecting past difficulties in managing credit.
DELINQ (Number of Delinquent Credit Lines): Similar to DEROG, a higher number of delinquent accounts may indicate trouble managing debt obligations, potentially predicting future default risk.
CLAGE (Age of Oldest Credit Line in Months): Older credit lines might imply a longer credit history and, possibly, more financial experience and stability, which could correlate with a lower risk of default.
NINQ (Number of Recent Credit Inquiries): A high number of recent inquiries could suggest financial distress or overextension, potentially increasing default risk.
CLNO (Number of Existing Credit Lines): This could be a double-edged sword; more credit lines might indicate creditworthiness and financial management skills but could also suggest potential overextension.
DEBTINC (Debt-to-Income Ratio): A higher ratio might indicate that a significant portion of the borrower's income is dedicated to debt repayment, potentially increasing the risk of default due to limited financial flexibility.
To begin our analysis, let's first review the current state of our dataset. It has been observed that there are missing values present across various variables. Given the significant proportion of these missing values, simply discarding observations with missing data is not an optimal approach, as it could introduce bias and potentially distort the underlying relationships between variables. Therefore, our next step will be to methodically examine each variable to understand both its distribution and the extent of its missing values. This approach will enable us to devise more informed strategies for handling these missing values effectively.
## Numerical data
numerical_columns = ["LOAN","MORTDUE", "VALUE", "YOJ", "DEROG", "DELINQ", "CLAGE", "NINQ","CLNO", "DEBTINC"]
## Categorical data
categorical_columns = ["JOB", "REASON"]
def plot_distribution_and_boxplot(dataset, column_name):
# Calculate the percentage of missing data
missing_percentage = dataset[column_name].isnull().mean() * 100
# Creating the subplot structure
fig, axs = plt.subplots(2, 1, figsize=(10, 8), gridspec_kw={'height_ratios': [3, 1], 'hspace': 0.5})
# Histogram with Density Plot on the first subplot
sns.histplot(dataset[column_name].dropna(), kde=True, bins=30, color='skyblue', ax=axs[0])
axs[0].axvline(dataset[column_name].mean(), color='red', linestyle='--', linewidth=2)
axs[0].set_title(f'{column_name} Distribution - Missing Data: {missing_percentage:.2f}%')
axs[0].set_xlabel(column_name)
axs[0].set_ylabel('Frequency')
# Boxplot on the second subplot
sns.boxplot(x=dataset[column_name], color='lightblue', ax=axs[1], showmeans=True)
axs[1].set_title(f'{column_name} Boxplot')
axs[1].set_xlabel(column_name)
axs[1].set_ylabel('')
plt.show()
for column in numerical_columns:
plot_distribution_and_boxplot(data, column)
def plot_categorical_distribution(data, column_name):
# Calculate counts and percentages
counts = data[column_name].value_counts(dropna=False) # Include NaN values in the count
total = data.shape[0] # Total number of rows to consider NaN in calculation
percentages = 100 * counts / total
# Calculate missing data percentage
missing_percentage = 100 * data[column_name].isnull().sum() / total
# Function for formatting autopct
def autopct_format():
def my_format(pct):
total_count = int(round(pct*total/100.0))
# Adjust to consider NaN if included in counts with dropna=False
return '{:.1f}%\n({:d})'.format(pct, total_count)
return my_format # Return the function itself
# Plot
plt.figure(figsize=(8, 8))
plt.pie(counts, labels=counts.index, autopct=autopct_format(), startangle=140)
plt.title(f'{column_name} Distribution - Missing Data: {missing_percentage:.2f}%')
plt.show()
for column in ["BAD"] + categorical_columns:
plot_categorical_distribution(data, column)
These assumptions can help guide strategies for handling the missing data and understanding its potential impact on predictive modeling efforts:
Random vs. Non-Random Missing Data:
Assumption: The missingness in data like "MORTDUE" (mortgage due), "DEBTINC" (debt-to-income ratio), and others might not be entirely random. For instance, missing "DEBTINC" values could be more common in applicants with complex income sources that are harder to document, or missing "JOB" information could be linked to self-employed applicants who might categorize their employment differently.
Implication: If missingness is non-random, simply ignoring or removing these cases could introduce bias or affect the model's accuracy. Analyzing the pattern of missing data can offer insights into its nature and guide appropriate imputation strategies.
Missingness Related to Applicant Characteristics:
Assumption: Missing values in "DEROG" (derogatory reports) and "DELINQ" (delinquencies) might relate to applicants with newer credit histories who haven’t encountered situations leading to derogatory remarks or delinquencies, hence the lack of recorded incidents.
Implication: This assumption suggests that for some variables, the absence of data could itself be informative, potentially indicating lower risk profiles for certain applicants.
Impact of Missing Data on Predictive Power:
Assumption: Columns with higher percentages of missing data, such as "DEBTINC" with over 21% missing, could have a significant impact on the model’s ability to accurately predict loan defaults if the missing data is not adequately addressed.
Implication: High levels of missing data in key predictive variables necessitate careful consideration of imputation methods to preserve or enhance the model’s predictive accuracy.
Correlation Between Missingness and Other Variables:
Assumption: The occurrence of missing data in one variable may be related to the presence or absence of data in another. For example, missing "VALUE" data might coincide with missing "MORTDUE" information, possibly because both are related to the applicant's property.
Implication: Understanding correlations between missing data across variables can inform multivariate imputation techniques, which consider these relationships to fill in missing values more accurately.
Missing Data as a Separate Category:
Assumption: For categorical variables like "JOB" and "REASON", the missingness could be treated as a separate category during analysis, under the assumption that not providing this information may itself be indicative of certain borrower behaviors or characteristics.
Implication: This approach can preserve the data's structure and provide additional insights into how the absence of certain information relates to loan default risk.
Lets try to fix missing data first, that will allow us to comprehed better the relationship between features, also we could start the univariable analysis at the same time
msno.heatmap(data)
<Axes: >
# Visualize missingness using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.title('Data Heatmap - Missing Values')
plt.xlabel('Variables')
plt.ylabel('Observations')
plt.show()
# Visualize missingness using a dendrogram
plt.figure(figsize=(10, 6))
msno.dendrogram(data)
plt.title('Dendrogram - Missing Values')
plt.show()
<Figure size 1000x600 with 0 Axes>
Understanding the patterns of missing data, including Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), is essential for robust data analysis and modeling. MCAR refers to missing data where the probability of being missing is unrelated to any observed or unobserved data.
That will help us to choose the best imputation strategy per feature
lets test each feature to see its missing data pattern behavior
## perform MCAR test
# Function to perform Chi-Square Test for categorical variables
def chi_square_test(data, categorical_vars):
for var in categorical_vars:
crosstab = pd.crosstab(data['missing_indicator'], data[var], dropna=False)
chi2, p, dof, expected = chi2_contingency(crosstab)
print(f"Chi-Square Test p-value for '{var}': {p}")
# Function to perform T-Test for numerical variables
def t_test(data, numerical_vars):
for var in numerical_vars:
group_missing = data[data['missing_indicator'] == 1][var]
group_not_missing = data[data['missing_indicator'] == 0][var]
t_stat, p_val = ttest_ind(group_missing, group_not_missing, nan_policy='omit')
print(f"T-Test p-value for '{var}': {p_val}")
def analyze_missingness(data, target_vars, numerical_vars, categorical_vars):
for target_var in target_vars:
print(f"Analyzing missingness for: {target_var}")
# Creating missing indicator for the target variable
data['missing_indicator'] = data[target_var].isnull().astype(int)
# Perform Chi-Square Test for categorical variables
chi_square_test(data, categorical_vars)
# Perform T-Test for numerical variables
t_test(data, numerical_vars)
# Cleanup by removing the missing indicator to prepare for the next iteration
data.drop('missing_indicator', axis=1, inplace=True)
# List of variables to analyze for missingness
target_vars = ['DEROG', 'DELINQ', 'NINQ', 'MORTDUE', 'YOJ']
analyze_missingness(data, target_vars, numerical_columns, categorical_columns)
Analyzing missingness for: DEROG Chi-Square Test p-value for 'JOB': 1.5990906100488e-08 Chi-Square Test p-value for 'REASON': 0.027062226903309197 T-Test p-value for 'LOAN': 6.473199772544665e-10 T-Test p-value for 'MORTDUE': 0.05800547283711083 T-Test p-value for 'VALUE': 0.00431179403233319 T-Test p-value for 'YOJ': 0.027449058255107944 T-Test p-value for 'DEROG': nan T-Test p-value for 'DELINQ': 5.044171424812875e-63 T-Test p-value for 'CLAGE': 0.007609702846965379 T-Test p-value for 'NINQ': 4.396373330994414e-17 T-Test p-value for 'CLNO': 1.7628567568539897e-09 T-Test p-value for 'DEBTINC': 0.7741292421448079 Analyzing missingness for: DELINQ Chi-Square Test p-value for 'JOB': 5.109007838191516e-08 Chi-Square Test p-value for 'REASON': 0.9646562191000062 T-Test p-value for 'LOAN': 0.0029638658137338967 T-Test p-value for 'MORTDUE': 0.28484137204793886 T-Test p-value for 'VALUE': 5.491012738618064e-10 T-Test p-value for 'YOJ': 3.3645166413844434e-12 T-Test p-value for 'DEROG': 4.624383627816484e-35 T-Test p-value for 'DELINQ': nan T-Test p-value for 'CLAGE': 0.013929240237087113 T-Test p-value for 'NINQ': 4.733497983306184e-14 T-Test p-value for 'CLNO': 0.2537841603573451 T-Test p-value for 'DEBTINC': 0.0005287782652992446 Analyzing missingness for: NINQ Chi-Square Test p-value for 'JOB': 2.8732129247173023e-09 Chi-Square Test p-value for 'REASON': 0.0019763944090644683 T-Test p-value for 'LOAN': 1.7552912380770204e-07 T-Test p-value for 'MORTDUE': 0.10418233931265687 T-Test p-value for 'VALUE': 1.1583923297513604e-06 T-Test p-value for 'YOJ': 1.2209486897259112e-06 T-Test p-value for 'DEROG': 1.6037639234345464e-24 T-Test p-value for 'DELINQ': 7.70451409694177e-08 T-Test p-value for 'CLAGE': 0.4045675477500539 T-Test p-value for 'NINQ': nan T-Test p-value for 'CLNO': 0.2959530795256866 T-Test p-value for 'DEBTINC': 5.228354749733804e-15 Analyzing missingness for: MORTDUE Chi-Square Test p-value for 'JOB': 2.3821585508420545e-32 Chi-Square Test p-value for 'REASON': 2.075224022308458e-33 T-Test p-value for 'LOAN': 0.391394413953697 T-Test p-value for 'MORTDUE': nan T-Test p-value for 'VALUE': 2.030110648485794e-46 T-Test p-value for 'YOJ': 0.33041270542226064 T-Test p-value for 'DEROG': 0.006338132538042123 T-Test p-value for 'DELINQ': 0.06293341574546857 T-Test p-value for 'CLAGE': 0.06302707121424661 T-Test p-value for 'NINQ': 0.0011816833564071351 T-Test p-value for 'CLNO': 2.3732824460988805e-77 T-Test p-value for 'DEBTINC': 6.456704925116695e-43 Analyzing missingness for: YOJ Chi-Square Test p-value for 'JOB': 2.1408200333960915e-42 Chi-Square Test p-value for 'REASON': 0.0710290417732955 T-Test p-value for 'LOAN': 0.0002735686785907759 T-Test p-value for 'MORTDUE': 8.88376354789684e-08 T-Test p-value for 'VALUE': 1.0935616110950211e-11 T-Test p-value for 'YOJ': nan T-Test p-value for 'DEROG': 8.826499217857786e-05 T-Test p-value for 'DELINQ': 0.002919192113766771 T-Test p-value for 'CLAGE': 1.6995749956210638e-10 T-Test p-value for 'NINQ': 0.0024950866646551127 T-Test p-value for 'CLNO': 2.692602511931492e-18 T-Test p-value for 'DEBTINC': 0.09096551714872332
# DEROG DELINQ CLINQ
# Identifying missing values for NINQ, DELINQ, and DEROG
missing_ninq = data['NINQ'].isnull()
missing_delinq = data['DELINQ'].isnull()
missing_derog = data['DEROG'].isnull()
# Combine missing conditions
missing_any = missing_ninq | missing_delinq | missing_derog
missing_all = missing_ninq & missing_delinq & missing_derog
# Descriptive statistics for CLAGE where any or all of the NINQ, DELINQ, DEROG are missing
print("CLAGE where any of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_any, 'CLAGE'].describe())
print("\nCLAGE where all of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_all, 'CLAGE'].describe())
# Descriptive statistics for CLNO where any or all of the NINQ, DELINQ, DEROG are missing
print("CLNO where any of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_any, 'CLNO'].describe())
print("\nCLNO where all of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_all, 'CLNO'].describe())
print("CLAGE and CLNO where NINQ, DELINQ, DEROG data are not missing:")
print(data.loc[~missing_any, ['CLAGE', 'CLNO']].describe())
CLAGE where any of NINQ, DELINQ, DEROG are missing:
count 615.000000
mean 181.681224
std 97.253324
min 0.507115
25% 109.590773
50% 155.143161
75% 253.116902
max 485.945358
Name: CLAGE, dtype: float64
CLAGE where all of NINQ, DELINQ, DEROG are missing:
count 149.000000
mean 175.009541
std 79.750657
min 70.253395
25% 115.312749
50% 159.801493
75% 184.296687
max 354.735919
Name: CLAGE, dtype: float64
CLNO where any of NINQ, DELINQ, DEROG are missing:
count 615.000000
mean 22.452033
std 11.774921
min 4.000000
25% 13.000000
50% 20.000000
75% 29.000000
max 56.000000
Name: CLNO, dtype: float64
CLNO where all of NINQ, DELINQ, DEROG are missing:
count 149.000000
mean 20.194631
std 8.782413
min 8.000000
25% 13.000000
50% 19.000000
75% 24.000000
max 39.000000
Name: CLNO, dtype: float64
CLAGE and CLNO where NINQ, DELINQ, DEROG data are not missing:
CLAGE CLNO
count 5037.000000 5123.000000
mean 179.532467 21.157330
std 84.314438 9.916689
min 0.000000 0.000000
25% 116.614859 15.000000
50% 174.506408 20.000000
75% 230.242235 26.000000
max 1168.233561 71.000000
General Observations: Significant p-values (<0.05) indicate a statistical relationship between the missingness of the target variable and the tested variable. A very low p-value suggests that the likelihood of observing the data under the null hypothesis (that there's no association between the missingness of the target variable and the tested variable) is extremely low. Non-significant p-values (≥0.05) suggest there's not enough evidence to conclude a relationship between the missingness of the target variable and the tested variable. nan in T-Test p-value results typically occurs when testing the variable against itself for missingness or when there's no variance in the group (e.g., all values are the same or missing).
"DEROG" Missingness: Strong associations are observed with "JOB", "REASON", "LOAN", "VALUE", "YOJ", "DELINQ", "CLAGE", "NINQ", and "CLNO". This widespread correlation suggests that the missingness in "DEROG" might be systematically related to both the applicant's job and reason for the loan, financial factors (loan amount, property value, years on the job), and other aspects of their credit history (delinquencies, credit inquiries, number of credit lines). The missingness here could be influenced by applicants' characteristics or might indicate a pattern where applicants with certain profiles are more likely to have or to omit this information.
"DELINQ" Missingness: The missingness shows strong associations with "JOB", "VALUE", "YOJ", "DEROG", "NINQ", and "DEBTINC". Similar to "DEROG", the pattern suggests a relationship between missingness and both employment-related factors and detailed financial variables, indicating possible profile similarities among those missing this data.
"NINQ" Missingness: Significant correlations with "JOB", "REASON", "LOAN", "VALUE", "YOJ", "DEROG", "DELINQ", and "DEBTINC" highlight how missingness in credit inquiries is linked to a wide range of factors, possibly pointing towards either data collection issues or specific applicant characteristics that lead to this missingness.
"MORTDUE" Missingness: Very strong associations with "JOB", "REASON", "VALUE", "NINQ", "CLNO", and "DEBTINC" indicate that the missing mortgage due amounts are not random and are particularly linked to the job, the reason for the loan, property values, and debt-to-income ratios, suggesting a pattern in the types of applicants or the conditions under which this data tends to be missing.
"YOJ" Missingness: Missing "YOJ" data shows significant links to "JOB", "LOAN", "MORTDUE", "VALUE", "DEROG", "DELINQ", "CLAGE", "NINQ", and "CLNO", implying that the missingness could be related to the applicants' job and financial details, including their credit history and loan characteristics.
In our data analysis, missingness in "NINQ", "DELINQ", and "DEROG" doesn't significantly alter the mean values of "CLAGE" (Age of Credit Line) and "CLNO" (Number of Credit Lines), suggesting other factors might contribute to the missing data. Hence, we considered Median Imputation and KNN Imputation for handling missing values in our predictive modeling phase.
Decision Rationale:
Additionally, for "DEBTINC", we employed median imputation due to its randomness and right skewness, ensuring a more representative imputation method for this variable.
Considering the dataset's structure and the need to preserve inter-variable relationships, we opted for KNN imputation with k=5, aiming for accurate imputations while managing computational resources and model complexity. Median imputation was specifically chosen for "DEBTINC" to address its skewed distribution and randomness.
#keep a copy of the data
original_data = data.copy()
# Initialize the KNNImputer
imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
# Select variables for imputation and relevant variables for KNN context
variables_to_impute = ['DEROG', 'DELINQ', 'NINQ', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'CLAGE', 'CLNO']
# Prepare data for imputation
data_to_impute = data[variables_to_impute].copy()
# Apply a log transformation to skewed variables, adding 1 to handle zeros
for col in ['DEROG', 'DELINQ', 'NINQ']:
data_to_impute[col] = np.log1p(data_to_impute[col])
# Perform the imputation
imputed_data_log_transformed = imputer.fit_transform(data_to_impute)
# Convert the imputed numpy array back to a DataFrame
imputed_df_log_transformed = pd.DataFrame(imputed_data_log_transformed, columns=variables_to_impute)
# Reverse the log transformation
for col in ['DEROG', 'DELINQ', 'NINQ']:
imputed_df_log_transformed[col] = np.expm1(imputed_df_log_transformed[col])
# Update the original DataFrame with the imputed values
data.update(imputed_df_log_transformed)
# Verify the imputation
print(data[variables_to_impute].describe())
DEROG DELINQ NINQ LOAN MORTDUE \
count 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000
mean 0.258066 0.479808 1.152833 18607.969799 72849.947455
std 0.800957 1.094570 1.665734 11207.480417 43153.211045
min 0.000000 0.000000 0.000000 1100.000000 2063.000000
25% 0.000000 0.000000 0.000000 11100.000000 46431.250000
50% 0.000000 0.000000 1.000000 16300.000000 64373.400000
75% 0.000000 0.643752 2.000000 23300.000000 89939.000000
max 10.000000 15.000000 17.000000 89900.000000 399550.000000
VALUE YOJ CLAGE CLNO
count 5960.000000 5960.000000 5960.000000 5960.000000
mean 101265.043217 8.907342 179.212184 21.225570
std 57256.005044 7.371312 84.143410 10.005451
min 8000.000000 0.000000 0.000000 0.000000
25% 65683.500000 3.000000 116.761588 15.000000
50% 88605.000000 7.000000 172.735145 20.000000
75% 119229.000000 13.000000 228.041251 26.000000
max 855909.000000 41.000000 1168.233561 71.000000
## handle DEBTINC
median_imputer = SimpleImputer(strategy='median')
debtinc_reshaped = data['DEBTINC'].values.reshape(-1, 1)
data['DEBTINC'] = median_imputer.fit_transform(debtinc_reshaped)
## create missing category for categorical data
for c in categorical_columns:
data[c].fillna('Unknown', inplace=True)
msno.matrix(data)
plt.show()
# Display summary statistics for the data
display("Median Imputed DataFrame Summary Statistics:", data.describe())
'Median Imputed DataFrame Summary Statistics:'
| BAD | LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 | 5960.000000 |
| mean | 0.199497 | 18607.969799 | 72849.947455 | 101265.043217 | 8.907342 | 0.258066 | 0.479808 | 179.212184 | 1.152833 | 21.225570 | 34.000651 |
| std | 0.399656 | 11207.480417 | 43153.211045 | 57256.005044 | 7.371312 | 0.800957 | 1.094570 | 84.143410 | 1.665734 | 10.005451 | 7.644528 |
| min | 0.000000 | 1100.000000 | 2063.000000 | 8000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.524499 |
| 25% | 0.000000 | 11100.000000 | 46431.250000 | 65683.500000 | 3.000000 | 0.000000 | 0.000000 | 116.761588 | 0.000000 | 15.000000 | 30.763159 |
| 50% | 0.000000 | 16300.000000 | 64373.400000 | 88605.000000 | 7.000000 | 0.000000 | 0.000000 | 172.735145 | 1.000000 | 20.000000 | 34.818262 |
| 75% | 0.000000 | 23300.000000 | 89939.000000 | 119229.000000 | 13.000000 | 0.000000 | 0.643752 | 228.041251 | 2.000000 | 26.000000 | 37.949892 |
| max | 1.000000 | 89900.000000 | 399550.000000 | 855909.000000 | 41.000000 | 10.000000 | 15.000000 | 1168.233561 | 17.000000 | 71.000000 | 203.312149 |
## distribution of the completed dataset
for column in numerical_columns:
plot_distribution_and_boxplot(data, column)
the distributions are not impacted by the filling the data
What is the range of values for the loan amount variable "LOAN"?
How does the distribution of years at present job "YOJ" vary across the dataset?
How many unique categories are there in the REASON variable?
What is the most common category in the JOB variable?
# gettingh the minimum and maximum loan amounts
loan_min = data['LOAN'].min()
loan_max = data['LOAN'].max()
loan_range = loan_max - loan_min
print(f"Minimum Loan Amount: ${loan_min}")
print(f"Maximum Loan Amount: ${loan_max}")
print(f"Range of Loan Amounts: ${loan_range}")
Minimum Loan Amount: $1100.0 Maximum Loan Amount: $89900.0 Range of Loan Amounts: $88800.0
How does the distribution of years at present job "YOJ" vary across the dataset?
Average Tenure: The average job tenure is approximately 8.91 years, indicating a moderately long period spent by individuals in their current jobs. However, the median tenure of 7 years suggests that half of the dataset's individuals have stayed in their jobs for a shorter duration, pointing towards a majority with relatively recent job changes.
Variability: There's a wide range in job tenure, from 0 to 41 years, with a standard deviation of about 7.37 years. This indicates diverse job tenure experiences among the individuals in the dataset.
Distribution: The distribution of job tenures is right-skewed, as evidenced by a mean that is higher than the median and the presence of individuals with very long tenures that extend up to 41 years. This skewness suggests that while many individuals have shorter tenures, there's a significant number with exceptionally long tenures, pulling the average higher.
Quartile Insights: A quarter of the dataset's individuals have been in their current jobs for 3 years or less, while 75% have tenures of 13 years or less. This quartile distribution further underscores the skewness towards shorter job tenures within the dataset.
How many unique categories are there in the REASON variable?
Three:
HomeImp "Home Improvement", DEBTCon "Debt Consolidation", and Unkown
What is the most common category in the JOB variable? OTHER
Leading questions
Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
Do applicants who default have a significantly different loan amount compared to those who repay their loan?
Is there a correlation between the value of the property and the loan default rate?
Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?
# Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
default_proportions = data.groupby('REASON')['BAD'].mean().reset_index()
plt.figure(figsize=(8, 6))
sns.barplot(x='REASON', y='BAD', data=default_proportions, palette='coolwarm')
plt.xlabel('Reason for Loan')
plt.ylabel('Proportion of Defaults')
plt.title('Proportion of Loan Defaults by Reason')
plt.xticks(rotation=45) # Rotate category labels for better readability
plt.show()
<ipython-input-98-738c8c357385>:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='REASON', y='BAD', data=default_proportions, palette='coolwarm')
Home improvement seems to have more propotion on defaulted loans but it is not representative, both values behave similarly
#Do applicants who default have a significantly different loan amount compared to those who repay their loan?
plt.figure(figsize=(10, 6))
sns.boxplot(x='BAD', y='LOAN', data=data, palette='coolwarm', notch=True, width=0.5)
plt.xticks([0, 1], ['Repaid', 'Defaulted']) # Setting custom labels for clarity
plt.title('Loan Amount Distribution by Repayment Status')
plt.ylabel('Loan Amount')
plt.xlabel('Status')
plt.show()
<ipython-input-99-753af1fa61fe>:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='BAD', y='LOAN', data=data, palette='coolwarm', notch=True, width=0.5)
The data analysis does not provide conclusive evidence on whether the loan amount impacts the likelihood of loans being paid or defaulted. Both paid and defaulted loans exhibit similar means and contain outliers. Therefore, we cannot confidently conclude that loan amount is a determining factor for loan repayment or default.
#Is there a correlation between the value of the property and the loan default rate?
correlation_coefficient = data['VALUE'].corr(data['BAD'])
print(f"Correlation coefficient between VALUE and BAD: {correlation_coefficient}")
plt.figure(figsize=(10, 6))
sns.scatterplot(x='VALUE', y='BAD', data=data, hue='BAD', palette='coolwarm', alpha=0.6)
plt.title('Correlation between Property Value and Loan Default Rate')
plt.xlabel('Property Value')
plt.ylabel('Loan Default Status')
plt.yticks([0, 1], ['Repaid', 'Defaulted']) # Adjusting y-ticks for clarity
plt.show()
Correlation coefficient between VALUE and BAD: -0.04547284494145901
The scatter plot depicting the relationship between "VALUE" and "BAD" reveals a correlation coefficient of approximately -0.045. Despite the low correlation, interesting patterns emerge in the data visualization.
However, it's important to note that these clusters do not imply a strong linear relationship between "VALUE" and loan repayment status. The clustering may suggest potential thresholds or ranges within which certain values of "VALUE" are more common for both paid and defaulted loans. Further analysis beyond correlation coefficients, such as examining distributions and exploring nonlinear relationships, may provide deeper insights into the association between loan value and repayment status.
#Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?
plt.figure(figsize=(10, 6))
sns.boxplot(x='BAD', y='MORTDUE', data=data, palette='coolwarm')
plt.xticks([0, 1], ['Repaid', 'Defaulted'])
plt.title('Mortgage Amount Distribution by Loan Repayment Status')
plt.xlabel('Loan Status')
plt.ylabel('Mortgage Amount')
plt.show()
<ipython-input-103-afd77bc710b5>:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='BAD', y='MORTDUE', data=data, palette='coolwarm')
The boxplot illustrates that defaulted loans do not display clear distinctions from paid loans; both distributions appear highly similar. Despite subtle clustering within specific ranges of "VALUE" for both paid and defaulted loans, the overall shapes of the distributions largely overlap. This observation underscores the difficulty of predicting loan repayment status based solely on the loan value. Further exploration of additional factors may be required to gain insights into the determinants of loan default.
def dist_boxplot(x, **kwargs):
ax = sns.histplot(x, kde=False, color="skyblue", alpha=0.6, bins=30)
# Creating a twin axis to overlay a boxplot
ax2 = ax.twinx()
sns.boxplot(x=x, ax=ax2, width=0.5, fliersize=2)
ax2.set(ylim=(-5, 5))
# Making boxplot transparent background
ax2.set_zorder(1)
ax2.patch.set_visible(False)
# For 'CLAGE'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'CLAGE')
# For 'NINQ'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'NINQ')
# For 'CLNO'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'CLNO')
# For 'DEBTINC'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'DEBTINC')
plt.show()
From the plots several conclusions can be drawn:
CLAGE (Age of Credit Line):
NINQ (Number of Inquiries):
CLNO (Number of Credit Lines):
DEBTINC (Debt-to-Income Ratio):
# Assuming 'data' is your DataFrame
sns.set_style("whitegrid")
# Count the observations for each category of 'BAD'
observation_counts = data['BAD'].value_counts().sort_index()
# Create a figure to hold the subplots
plt.figure(figsize=(14, 6))
# Plot for DEROG
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st subplot
sns.boxplot(x='BAD', y='DEROG', data=data)
# Adding observation count to the title
plt.title(f'Derogatory Reports by Loan Default Status\nCounts: {observation_counts.to_string()}')
# Plot for DELINQ
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd subplot
sns.boxplot(x='BAD', y='DELINQ', data=data)
# Adding observation count to the title
plt.title(f'Delinquent Credit Lines by Loan Default Status\nCounts: {observation_counts.to_string()}')
plt.tight_layout()
plt.show()
Borrowers who defaulted on their mortgage have more delinquent credit lines and major derogatory reports than those who did not.
sns.set_style("whitegrid")
# Create a figure to hold the subplots
plt.figure(figsize=(14, 6))
# Count the observations for each category of 'BAD'
observation_counts = data['BAD'].value_counts()
# Plot for DEBTINC
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st subplot
sns.boxplot(x='BAD', y='DEBTINC', data=data)
# Adding observation count to the title
plt.title(f'Debt-to-Income Ratio by Loan Default Status\nCounts: {observation_counts[0]} (BAD=0), {observation_counts[1]} (BAD=1)')
# Plot for LOAN
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd subplot
sns.boxplot(x='BAD', y='LOAN', data=data)
# Adding observation count to the title
plt.title(f'Loan Request Amount by Loan Default Status\nCounts: {observation_counts[0]} (BAD=0), {observation_counts[1]} (BAD=1)')
plt.tight_layout()
plt.show()
Debt-to-Income Ratio (DEBTINC) by Loan Default Status:
Loan Request Amount (LOAN) by Loan Default Status:
plt.figure(figsize=(8, 6))
observation_counts = data['BAD'].value_counts()
sns.boxplot(x='BAD', y='CLAGE', data=data)
plt.title(f'Age of Oldest Credit Line by Loan Default Status\nCounts: {observation_counts[0]} (BAD=0), {observation_counts[1]} (BAD=1)')
plt.xlabel('Loan Default Status')
plt.ylabel('Age of Oldest Credit Line (Months)')
plt.show()
Intuitively, borrowers who paid their loans are credible to the bank, thus have older credit lines, make more recent credit inquiries, and generally have more credit lines in total than default customers.
Borrowers who did not default have significantly higher debt-income ratio as well as loan request amount, as they are able to continue borrowing from the bank given their extensive credit lines (without having to reapply).
correlation_matrix = data.corr(numeric_only=True)
# Plotting heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.show()
Relationship with "BAD" (Loan Default Indicator)
DEROG (Number of Major Derogatory Reports) and DELINQ (Number of Delinquent Credit Lines) have the most substantial positive correlations with "BAD" at 0.26 and 0.33, respectively. This suggests that applicants with more derogatory reports and delinquencies are more likely to default on their loans.
CLAGE (Age of Oldest Credit Line in Months) shows a significant negative correlation (-0.16) with "BAD", indicating that longer credit histories are associated with a lower likelihood of default.
NINQ (Number of Recent Credit Inquiries) has a moderately positive correlation (0.17) with "BAD", suggesting that a higher number of recent credit inquiries might be associated with a higher risk of default, possibly reflecting financial stress or overextension.
DEBTINC (Debt-to-Income Ratio) also shows a positive correlation (0.15) with "BAD", indicating that higher debt relative to income might increase the likelihood of loan default.
Other Notable Correlations
MORTDUE (Amount Owed on Mortgage) and VALUE (Current Value of the Property) are highly correlated (0.78), as expected, since larger mortgages are typically associated with more valuable properties.
MORTDUE and CLNO (Number of Credit Lines) show a significant positive correlation (0.32), suggesting that individuals with higher mortgage amounts also tend to have more credit lines, possibly reflecting higher overall credit engagement or financial activity.
CLAGE shows a positive correlation with YOJ (Years at Present Job) (0.18), indicating that individuals with longer employment tenure also tend to have older credit lines, which could reflect overall financial stability.
Insights for Predictive Modeling
Variables like DEROG, DELINQ, NINQ, and DEBTINC could be key predictors in a model aimed at predicting loan defaults, given their significant correlations with "BAD".
The negative correlation of CLAGE with "BAD" highlights the importance of considering the age of credit history as a potential protective factor against default.
The lack of a strong correlation between LOAN size and default risk ("BAD") suggests that the amount of the loan itself is not as predictive of default as the borrower's credit history and current financial obligations.
Given our dataset's inherent right skewness and presence of outliers, it becomes imperative to normalize the data for effective analysis. To achieve this, we employed the np.log1p transformation as it aligns well with the characteristics of our data. This transformation not only addresses the skewness but also mitigates the impact of outliers, rendering the data more suitable for subsequent analysis.
# Create a new DataFrame for the log-transformed data
log_transformed_data = data.copy()
# Apply log transformation to all numerical columns
# Using np.log1p to handle columns with zeros by computing log(1 + x) for each element
for col in numerical_columns:
log_transformed_data[col] = np.log1p(log_transformed_data[col])
log_transformed_data[numerical_columns].head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.003974 | 10.160491 | 10.571983 | 2.442347 | 0.000000 | 0.000000 | 4.557729 | 0.693147 | 2.302585 | 3.578458 |
| 1 | 7.170888 | 11.157022 | 11.133143 | 2.079442 | 0.000000 | 1.098612 | 4.810828 | 0.000000 | 2.708050 | 3.578458 |
| 2 | 7.313887 | 9.510519 | 9.723224 | 1.609438 | 0.000000 | 0.000000 | 5.013742 | 0.693147 | 2.397895 | 3.578458 |
| 3 | 7.313887 | 10.780655 | 10.934745 | 2.054124 | 0.277259 | 0.439445 | 4.718258 | 0.415888 | 2.610070 | 3.578458 |
| 4 | 7.438972 | 11.490690 | 11.626263 | 1.386294 | 0.000000 | 0.000000 | 4.546835 | 0.000000 | 2.708050 | 3.578458 |
for column in numerical_columns:
plot_distribution_and_boxplot(log_transformed_data, column)
The exploratory data analysis (EDA) reveals several significant insights into factors influencing loan default risk. Derived from the correlation matrix, variables like DEROG, DELINQ, NINQ, and DEBTINC exhibit notable correlations with loan default ("BAD"), indicating that applicants with more derogatory reports, delinquent credit lines, recent credit inquiries, and higher debt-to-income ratios are more likely to default on loans. Conversely, CLAGE (Age of Oldest Credit Line) shows a negative correlation with default, suggesting that longer credit histories may mitigate default risk. Notably, loan size (LOAN) demonstrates a weaker correlation with default, indicating that other factors such as credit history and financial obligations play significant roles in predicting default. These findings underscore the importance of incorporating multiple variables, including credit history, recent financial behavior, and debt levels, in predictive modeling to accurately assess loan default risk.
## prepare data
X = log_transformed_data.drop('BAD', axis=1)
y = log_transformed_data['BAD']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# due to several outliers after normalization, lets winsorize the data for logistic regression
X_train_winsorized = X_train.copy()
for column in numerical_columns:
X_train_winsorized[column] = winsorize(X_train_winsorized[column], limits=[0.05, 0.05])
# Adjusted Preprocessing for Categorical Data
# Define the ColumnTransformer for categorical data only, since numerical data doesn't require further preprocessing
preprocessor = ColumnTransformer(transformers=[
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
], remainder='passthrough') # 'remainder=passthrough' to keep numerical columns without transformation
# Define the model
logistic_regression_model = LogisticRegression(max_iter=1000, random_state=42)
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', logistic_regression_model)])
model_pipeline.fit(X_train_winsorized, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])),
('model', LogisticRegression(max_iter=1000, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])),
('model', LogisticRegression(max_iter=1000, random_state=42))])ColumnTransformer(remainder='passthrough',
transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])['JOB', 'REASON']
OneHotEncoder(handle_unknown='ignore')
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
passthrough
LogisticRegression(max_iter=1000, random_state=42)
# Predictions
y_pred = model_pipeline.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy: 0.8196308724832215
Confusion Matrix:
[[873 54]
[161 104]]
Classification Report:
precision recall f1-score support
0 0.84 0.94 0.89 927
1 0.66 0.39 0.49 265
accuracy 0.82 1192
macro avg 0.75 0.67 0.69 1192
weighted avg 0.80 0.82 0.80 1192
The logistic regression model is achieving an accuracy rate of approximately 81.96%. The model excelled in identifying loans that were paid off (class 0), with a precision of 84% and a high recall of 94%, yielding an F1-score of 89%. This indicates a strong capability in accurately predicting loans that would be repaid. However, the model's performance in detecting defaulted loans (class 1) was less effective, evidenced by a recall of 39% and precision of 66%, leading to a modest F1-score of 49%. These results highlight a significant challenge in correctly identifying default cases, which is critical for risk assessment and mitigation in financial lending.
Lets try with another model
decision_tree_model = DecisionTreeClassifier(random_state=42, class_weight = {0: 0.2, 1: 0.8}) # make class 1 more important
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', decision_tree_model)])
model_pipeline.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])),
('model',
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7},
random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])),
('model',
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7},
random_state=42))])ColumnTransformer(remainder='passthrough',
transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])['JOB', 'REASON']
OneHotEncoder(handle_unknown='ignore')
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
passthrough
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, random_state=42)y_pred = model_pipeline.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy: 0.8691275167785235
Confusion Matrix:
[[867 60]
[ 96 169]]
Classification Report:
precision recall f1-score support
0 0.90 0.94 0.92 927
1 0.74 0.64 0.68 265
accuracy 0.87 1192
macro avg 0.82 0.79 0.80 1192
weighted avg 0.86 0.87 0.87 1192
The Decision Tree model achieved an impressive accuracy of approximately 87%. The model demonstrated strong performance in identifying loans that were successfully repaid (class 0), with a precision of 90% and a recall of 94%, resulting in an F1-score of 92%. This indicates a high reliability in predicting non-default cases. For defaulted loans (class 1), the model also showed commendable results with a precision of 74%, a recall of 64%, and an F1-score of 68%, suggesting a solid ability to identify default cases, albeit with room for improvement. The overall model accuracy and the detailed performance metrics underscore the Decision Tree's effectiveness in distinguishing between paid and defaulted loans. The weighted average F1-score of 87% reflects the model's robustness across both classes. However, the slightly lower recall for class 1 highlights an area for potential enhancement, aiming to better capture defaulted loans without significantly compromising the precision.
Let see if we can get better results by tunning the tree
decision_tree_model.get_depth()
24
def evaluate_model_depth_with_test(X_train, y_train, X_test, y_test, depth_range, cv_folds=5):
"""Evaluates Decision Tree model over a range of depths using both cross-validation on the training set
and evaluation on the test set.
Args:
X_train (DataFrame): Training features.
y_train (Series): Training target variable.
X_test (DataFrame): Test features.
y_test (Series): Test target variable.
depth_range (range): Range of depths to evaluate.
cv_folds (int): Number of folds for cross-validation.
Returns:
DataFrame: Contains average cross-validated scores for each depth and test set scores.
"""
cv_scores = []
test_errors = []
for depth in depth_range:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42, class_weight={0: 0.3, 1: 0.7})
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', clf)])
# Cross-validation on training data
scores = cross_val_score(model_pipeline, X_train, y_train, cv=cv_folds, scoring='neg_mean_squared_error')
cv_scores.append(np.mean(scores))
model_pipeline.fit(X_train, y_train)
y_test_pred = model_pipeline.predict(X_test)
test_error = mean_squared_error(y_test, y_test_pred)
test_errors.append(test_error)
return pd.DataFrame({
'Max Depth': list(depth_range),
'CV Score': cv_scores,
'Test Error': test_errors
})
depth_range = range(1, 24)
results = evaluate_model_depth_with_test(X_train, y_train, X_test, y_test, depth_range)
# Convert the scores to positive; higher is better for CV Score, lower is better for Test Error
results['CV Misclassification Error'] = -results['CV Score']
results['Test Misclassification Error'] = results['Test Error']
# Plotting
plt.figure(figsize=(12, 8))
plt.plot(results['Max Depth'], results['CV Misclassification Error'], marker='o', linestyle='-', color='b', label='CV Misclassification Error')
plt.plot(results['Max Depth'], results['Test Misclassification Error'], marker='s', linestyle='--', color='r', label='Test Misclassification Error')
plt.title('Max Depth vs. Misclassification Error')
plt.xlabel('Max Depth')
plt.ylabel('Misclassification Error')
plt.grid(True)
plt.xticks(depth_range)
plt.legend()
plt.show()
It seems that at depth 11 we have a minimized test error and CV maximized, it represents a good balance between model complexity and generalization ability.
Criterion {“gini”, “entropy”}
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf
The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
You can learn about more Hyperpapameters on this link and try to tune them.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
param_grid = {
'model__max_depth': [9, 10, 11, 12, 13, 14],
'model__min_samples_split': [2, 5, 10, 20],
'model__min_samples_leaf': [1, 2, 4, 6, 10],
'model__class_weight': [{0: 0.3, 1: 0.7}, 'balanced'],
}
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['JOB',
'REASON'])])),
('model',
DecisionTreeClassifier(class_weight={0: 0.2,
1: 0.8},
random_state=42))]),
n_jobs=-1,
param_grid={'model__class_weight': [{0: 0.3, 1: 0.7}, 'balanced'],
'model__max_depth': [9, 10, 11, 12, 13, 14],
'model__min_samples_leaf': [1, 2, 4, 6, 10],
'model__min_samples_split': [2, 5, 10, 20]},
scoring='accuracy', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['JOB',
'REASON'])])),
('model',
DecisionTreeClassifier(class_weight={0: 0.2,
1: 0.8},
random_state=42))]),
n_jobs=-1,
param_grid={'model__class_weight': [{0: 0.3, 1: 0.7}, 'balanced'],
'model__max_depth': [9, 10, 11, 12, 13, 14],
'model__min_samples_leaf': [1, 2, 4, 6, 10],
'model__min_samples_split': [2, 5, 10, 20]},
scoring='accuracy', verbose=1)Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])),
('model',
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8},
random_state=42))])ColumnTransformer(remainder='passthrough',
transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
['JOB', 'REASON'])])['JOB', 'REASON']
OneHotEncoder(handle_unknown='ignore')
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
passthrough
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=42)# Best hyperparameters
print("Best hyperparameters:\n", grid_search.best_params_)
# Best model's score
print("Best model's accuracy:", grid_search.best_score_)
best_model = grid_search.best_estimator_
Best hyperparameters:
{'model__class_weight': {0: 0.3, 1: 0.7}, 'model__max_depth': 9, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2}
Best model's accuracy: 0.8794047705469431
# Predictions with the best model
y_pred = best_model.predict(X_test)
# Evaluation
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy on test set: 0.8808724832214765
Confusion Matrix:
[[858 69]
[ 73 192]]
Classification Report:
precision recall f1-score support
0 0.92 0.93 0.92 927
1 0.74 0.72 0.73 265
accuracy 0.88 1192
macro avg 0.83 0.83 0.83 1192
weighted avg 0.88 0.88 0.88 1192
The hyperparameter-tuned Decision Tree model, optimized via GridSearch, demonstrated a notable increase in performance, achieving an accuracy of 88.09% on the test set, up from 86.66% with the initial model. This refined model significantly improved in detecting defaulted loans (Class 1), with recall rising from 64% to 72% and precision at 74%, indicating a stronger ability to identify defaults accurately. The overall balanced performance across classes also improved, as evidenced by the increase in macro average F1-score from 80% to 83%. This optimization underscores the effectiveness of hyperparameter tuning in enhancing model accuracy and balance, particularly in addressing class imbalances and improving default detection, making it a more reliable tool for financial risk mitigation.
decision_tree_model = best_model.named_steps['model']
def get_transformed_feature_names(column_transformer, input_features):
"""
Get feature names from a ColumnTransformer that includes one-hot encoding and passthrough transformers.
Args:
- column_transformer: The ColumnTransformer instance.
- input_features: The original feature names (as a list).
Returns:
- A list of the transformed feature names.
"""
transformed_feature_names = []
for transformer in column_transformer.transformers_:
transformer_name, transformer_instance, columns = transformer
# Handling for OneHotEncoder
if hasattr(transformer_instance, 'get_feature_names_out'):
if columns is not None and transformer_name != 'remainder':
transformed_feature_names.extend(transformer_instance.get_feature_names_out(columns))
else:
transformed_feature_names.extend(transformer_instance.get_feature_names_out())
elif transformer_name == 'remainder' and transformer_instance == 'passthrough':
remainder_columns = [input_features[i] for i in columns] if isinstance(columns, list) else input_features
transformed_feature_names.extend(remainder_columns)
return transformed_feature_names
input_features = list(X_train.columns)
transformed_feature_names = get_transformed_feature_names(preprocessor, input_features)
plt.figure(figsize=(40, 40))
plot_tree(decision_tree_model, feature_names=transformed_feature_names, filled=True, class_names=['Paid', 'Defaulted'], max_depth=14, fontsize=10, rounded=True, precision=2, proportion=False)
plt.title('Decision Tree for Loan Default Prediction', fontsize=20)
#plt.savefig('tree_high_res.png', dpi=300, bbox_inches='tight') # Save to file
plt.show()
key insights from the decision tree:
Primary Split on Debt-to-Income Ratio (DEBTINC): The tree's initial split is based on the DEBTINC feature, indicating its significant role in predicting loan outcomes. A DEBTINC threshold of 3.58 separates the data, suggesting that lower debt-to-income ratios are associated with a higher likelihood of loan repayment.
Importance of Delinquency (DELINQ) and Number of Inquiries (NINQ): Following DEBTINC, the tree frequently utilizes DELINQ and NINQ for further splits. This underscores the importance of past payment behavior and recent credit inquiries in assessing loan default risk.
Role of Loan Reasons (REASON_HomeImp): The decision to split based on whether the loan is for home improvement (REASON_HomeImp) indicates that the purpose of the loan has predictive value, with different default risks associated with home improvement loans versus other reasons.
Value of Collateral (VALUE) and Number of Credit Lines (CLNO): These features appear in various splits, suggesting that the collateral's value and the borrower's existing credit lines are relevant factors in predicting loan performance.
Interaction of Features: The tree structure reveals complex interactions between features. For example, within certain ranges of DEBTINC and DELINQ, other variables like VALUE, NINQ, and CLNO come into play, indicating that the risk of default is contingent on multiple factors.
Thresholds for Different Features: The specific thresholds used for splits (e.g., DELINQ <= 1.05, VALUE > 10.08) provide insights into critical values that distinguish between likely loan repayment and default. These thresholds can inform risk assessment strategies.
Subgroup Specific Patterns: The tree highlights specific patterns for subgroups within the data. For instance, within loans with a particular DEBTINC and DELINQ profile, further distinctions based on MORTDUE, CLAGE (age of oldest credit line), and other variables indicate nuanced patterns of risk.
Sensitivity to Certain Conditions: The model's branches reflect sensitivity to certain conditions, like higher DEBTINC combined with specific levels of DEROG (derogatory reports) and DELINQ, significantly influencing the prediction of default.
Comparing this decision tree's structure and insights with the initial, unoptimized model reveals the impact of hyperparameter tuning, particularly in terms of identifying meaningful splits and interactions that might not have been as clearly captured previously. The tuned model likely offers a more nuanced understanding of factors influencing loan defaults, enhancing its predictive accuracy and providing a solid basis for informed decision-making in loan approval processes.
transformed_feature_names = best_model.named_steps['preprocessor'].get_feature_names_out()
importances = decision_tree_model.feature_importances_
feature_importances = pd.DataFrame({'Feature': transformed_feature_names, 'Importance': importances})
# Sort the DataFrame to display the most important features at the top
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances.head(20)) # Display top 20 for clarity
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
Dominance of Debt-to-Income Ratio (DEBTINC): With a feature importance score of approximately 64.76%, the debt-to-income ratio is by far the most influential factor in predicting loan outcomes. This underscores the critical role of borrowers' financial health and their ability to manage existing debt relative to their income in determining loan repayment capabilities.
Significance of Payment History and Credit Age: Following DEBTINC, the delinquency (DELINQ) and age of the oldest credit line (CLAGE) are the next most important features, with importance scores around 5.91% and 5.75%, respectively. This highlights the importance of borrowers' payment history and the maturity of their credit history in assessing default risk.
Relevance of Employment and Loan Attributes: Features like years at the current job (YOJ), the number of credit lines (CLNO), derogatory reports (DEROG), property value (VALUE), loan amount (LOAN), and mortgage due (MORTDUE) also play significant roles, albeit to a lesser extent compared to DEBTINC, DELINQ, and CLAGE. These factors collectively capture aspects of borrowers' stability, credit behavior, and collateral value.
Limited Influence of Job Type and Loan Reason: Specific job categories (e.g., Self, Office, Mgr) and the reason for the loan (Home Improvement) have much lower importance scores, indicating a relatively minor direct influence on loan default predictions in this model. Notably, some job types (ProfExe, Sales, Other) and one loan reason (DebtCon) show zero importance, suggesting these features do not contribute to the model's decision-making process in this dataset.
Non-uniform Distribution of Feature Importance: The distribution of importance scores is highly skewed towards a few key features, particularly DEBTINC, emphasizing the model's reliance on a subset of highly predictive attributes over others.
Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.
The results from all the decision trees are combined together and the final prediction is made using voting or averaging.
# Define the Random Forest model
random_forest_model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
# Create a pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', random_forest_model)])
model_pipeline.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median')),
('scaler',
StandardScaler())]),
['LOAN', 'MORTDUE', 'VALUE',
'YOJ', 'DEROG', 'DELINQ',
'CLAGE', 'NINQ', 'CLNO',
'DEBTINC']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['JOB', 'REASON'])])),
('model', RandomForestClassifier(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median')),
('scaler',
StandardScaler())]),
['LOAN', 'MORTDUE', 'VALUE',
'YOJ', 'DEROG', 'DELINQ',
'CLAGE', 'NINQ', 'CLNO',
'DEBTINC']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['JOB', 'REASON'])])),
('model', RandomForestClassifier(random_state=42))])ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median')),
('scaler', StandardScaler())]),
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
'DELINQ', 'CLAGE', 'NINQ', 'CLNO',
'DEBTINC']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['JOB', 'REASON'])])['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
SimpleImputer(strategy='median')
StandardScaler()
['JOB', 'REASON']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
RandomForestClassifier(random_state=42)
# Make predictions
y_pred = model_pipeline.predict(X_test)
# Evaluate the model
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy on test set: 0.910234899328859
Confusion Matrix:
[[909 18]
[ 89 176]]
Classification Report:
precision recall f1-score support
0 0.91 0.98 0.94 927
1 0.91 0.66 0.77 265
accuracy 0.91 1192
macro avg 0.91 0.82 0.86 1192
weighted avg 0.91 0.91 0.90 1192
The performance of the Random Forest Classifier on the same dataset demonstrates a significant enhancement in predicting loan repayment outcomes, achieving an accuracy of 91.02% on the test set. This improvement over the decision tree model's earlier performance is notable in several key areas:
Increased Overall Accuracy: The Random Forest model exhibits a higher overall accuracy (91.02%) compared to the decision tree's accuracy (approximately 88.09%). This indicates a superior ability to correctly classify both paid and defaulted loans.
Improved Precision and Recall for Defaulted Loans: The precision for predicting defaulted loans (class 1) remains high at 91%, similar to the decision tree model. However, the recall for class 1 improves to 66%, up from the decision tree's recall of approximately 72%, indicating that the Random Forest is better at identifying a larger proportion of actual defaults, albeit with a slight decrease compared to the decision tree. This might be due to the balancing effect of the ensemble method, which reduces overfitting and variance.
High Precision and Recall for Paid Loans: The Random Forest model maintains high precision (91%) and an impressive recall (98%) for predicting paid loans (class 0), suggesting a strong capability in identifying loans that will not default. This is an improvement over the decision tree model, which had a slightly lower precision and recall for class 0.
Balanced Performance Across Classes: The macro averages for precision, recall, and F1-score show that the Random Forest model offers a more balanced performance across both classes compared to the decision tree. This balance is crucial for practical applications where both identifying defaults accurately and minimizing false positives are important.
Enhanced F1-Scores Indicate Model Robustness: The F1-scores, which balance precision and recall, are higher for both classes in the Random Forest model, especially notable in the increased F1-score for defaulted loans (class 1) to 77%. This improvement suggests that the Random Forest model is not only accurate but also robust, providing reliable predictions across diverse scenarios.
In conclusion, the Random Forest Classifier's superior performance reflects its robustness and effectiveness in predicting loan repayment outcomes, making it a preferable choice over a single decision tree for tasks involving complex datasets with imbalanced classes.
param_grid = {
'model__n_estimators': [100, 200],
'model__max_depth': [None, 10, 20],
#'model__min_samples_split': [2, 5, 10],
'model__min_samples_leaf': [1, 2, 4, ],
'model__class_weight': [None, 'balanced']
}
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', RandomForestClassifier(random_state=42, class_weight={0: 1, 1: 3})) #improve recall
])
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
# Evaluate the best model found by GridSearch
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Best model's accuracy:", accuracy_score(y_test, y_pred))
Fitting 5 folds for each of 36 candidates, totalling 180 fits Best model's accuracy: 0.9085570469798657
# Evaluate the model
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy on test set: 0.9085570469798657
Confusion Matrix:
[[901 26]
[ 83 182]]
Classification Report:
precision recall f1-score support
0 0.92 0.97 0.94 927
1 0.88 0.69 0.77 265
accuracy 0.91 1192
macro avg 0.90 0.83 0.86 1192
weighted avg 0.91 0.91 0.90 1192
random_forest_model = best_model.named_steps['model']
feature_importances = random_forest_model.feature_importances_
transformed_feature_names = best_model.named_steps['preprocessor'].get_feature_names_out()
feature_importances_df = pd.DataFrame({
'Feature': transformed_feature_names,
'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances_df.head(20))
plt.title('Top 20 Feature Importances in Random Forest Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
The feature importance extracted from your hyper-tuned Random Forest model provides valuable insights into the factors influencing loan repayment predictions. Here are some conclusions and observations:
Dominant Role of Debt-to-Income Ratio (DEBTINC): The most critical feature influencing loan repayment predictions is the borrower's debt-to-income ratio, accounting for approximately 22.15% of the model's decision-making. This highlights the paramount importance of assessing borrowers' financial health and their ability to manage debt relative to their income.
Significance of Credit History: Features related to borrowers' credit history, including the age of the oldest credit line (CLAGE), delinquency records (DELINQ), and derogatory reports (DEROG), are among the top influencers. These attributes collectively underscore the relevance of a borrower's past credit behavior in predicting loan defaults, with CLAGE and DELINQ, in particular, being nearly equally important after DEBTINC.
Loan Amount and Property Value: The loan amount (LOAN) and the property value (VALUE) significantly influence predictions, indicating that the size of the loan and the value of the collateral are key factors in assessing loan risk.
Credit Lines and Mortgage Due: The number of credit lines (CLNO) and the mortgage due (MORTDUE) also play substantial roles, suggesting that the breadth of a borrower's credit relationships and the outstanding mortgage amount are pertinent to their likelihood of loan repayment.
Years at Job and Number of Inquiries: The years at the current job (YOJ) and the number of recent credit inquiries (NINQ) demonstrate a meaningful impact, albeit to a lesser extent compared to the top factors. These features reflect the stability of the borrower's employment and their recent search for credit, which can influence their repayment capability.
Influence of Job Type and Loan Reason: Categorical variables related to the borrower's job type and the reason for the loan (DebtCon for debt consolidation, HomeImp for home improvement) show lower importance scores. However, their presence in the list indicates that these aspects, while not as critical as financial metrics, still provide valuable context for predicting loan outcomes.
Minor Variations Among Job Types and Loan Reasons: The relatively close importance scores among different job types and loan reasons suggest a nuanced effect of these factors on loan repayment predictions. The model does not heavily favor one specific job type or loan reason over another, indicating a more balanced consideration of these attributes.
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
| Model | Accuracy | Precision (Class 0) | Precision (Class 1) | Recall (Class 0) | Recall (Class 1) | F1-score |
|---|---|---|---|---|---|---|
| Logistic Regression | 81.96% | 84% | 66% | 94% | 39% | 0.80 |
| Decision Tree | 86.91% | 90% | 74% | 94% | 64% | 0.87 |
| Hypertuned Decision Tree | 88.09% | 92% | 74% | 93% | 72% | 0.88 |
| Random Forest | 91.02% | 91% | 91% | 98% | 66% | 0.91 |
| Hypertuned Random Forest | 90.86% | 92% | 88% | 97% | 69% | 0.91 |
Overall, the Random Forest model, both basic and hypertuned, performs relatively better compared to other techniques, with higher accuracy and better balance between precision and recall for both classes.
2. Refined insights:
Critical Features Across Models:
Model Performance and Complexity:
3. Proposal for the final solution design:
Based on the analysis and results obtained from the logistic regression, decision tree, and random forest models, adopting the Random Forest model as the best solution for predicting loan repayment outcomes. The decision is rooted in several key factors that make it particularly suited for this problem:
Balanced Performance:
Handling of Complex Relationships:
Robustness to Overfitting:
Importance of Features:
Flexibility and Scalability:
Conclusion:
The Random Forest model strikes an effective balance between accuracy, interpretability, and operational feasibility, making it the best solution among the models considered. Its superior performance metrics, coupled with its robustness and flexibility, position it as a valuable tool for predicting loan repayment outcomes. Adopting this model can enhance the ability of financial institutions to assess loan risk more accurately, leading to better lending decisions, reduced default rates, and potentially more competitive financial products. Continuous monitoring and periodic re-tuning of the model will ensure its relevance and effectiveness in the dynamic landscape of credit risk assessment.
Objective and Key Findings
Our comprehensive analysis aimed at enhancing loan default prediction models has led to significant insights and the identification of a robust predictive tool. Key findings include:
Critical Predictive Factors: The Debt-to-Income Ratio (DEBTINC) was identified as the most influential predictor of loan defaults. Other critical factors include credit history attributes like the age of the oldest credit line (CLAGE), delinquencies (DELINQ), and derogatory reports (DEROG). Loan amount (LOAN) and property value (VALUE) were also significant, affecting the risk assessment outcomes.
Optimal Model Selection: Among various models tested, the Random Forest model outperformed others, exhibiting superior accuracy and balance between recall and precision. This model effectively captures complex data relationships and provides robust predictability without overfitting.
Insightful Data Visualizations: Visual analyses reinforced the quantitative findings, showing clear distinctions in financial behaviors between defaulting and non-defaulting applicants. These insights are crucial for understanding the underlying patterns that influence loan outcomes.
Final Model Specifications
The chosen Random Forest Classifier demonstrates excellent capability in handling the complexities of loan default predictions. It is characterized by:
Strategic Recommendations and Next Steps
Model Enhancement: Continue refining the Random Forest model through hyperparameter tuning.
Feature Expansion: Investigate additional features and interactions that may uncover deeper insights into default risks. Update the model periodically to adapt to new economic conditions or data.
Deployment and Monitoring: Implement the model within the existing loan processing infrastructure for real-time assessments and establish a system for ongoing performance monitoring and model updates.
Regulatory Compliance and Ethics: Regularly review the model for compliance with financial regulations and ethical standards, ensuring it remains free from biases and maintains fairness across different borrower demographics.
Continued Research and Development: Foster ongoing research into new data sources and predictive technologies to stay ahead of market trends and economic shifts.
This strategic approach ensures that our predictive modeling not only enhances financial decision-making but also aligns with industry best practices and regulatory standards, thereby supporting sustainable and profitable lending operations.
Summary of the Problem
The primary challenge faced by the financial sector, specifically in the domain of retail banking, revolves around accurately predicting loan defaults. This issue is critical as defaults significantly impact a bank’s profitability and operational efficiency. Traditionally, the loan approval process has relied heavily on manual assessment, which not only consumes substantial time and resources but is also prone to human error and bias. In an era where financial markets are rapidly evolving and consumer credit profiles are becoming increasingly complex, traditional methods have shown limitations in effectively predicting loan repayment behaviors.
Key Points of the Final Proposed Solution Design
The proposed solution involves the deployment of a sophisticated Random Forest Classifier model, designed with the following key attributes:
Reason for the Proposed Solution Design
The design of this model is motivated by the need to enhance the efficiency and accuracy of the loan approval process. Random Forest was selected due to its ability to perform well with complex datasets that feature nonlinear relationships and interactions among variables. Its ensemble nature also mitigates the risk of overfitting, making it more reliable for operational use. The model’s ability to provide feature importance rankings further aids in refining risk assessment processes and developing more targeted financial products.
Impact on the Problem/Business
Implementing this solution would have a transformative effect on the business by:
In conclusion, the adoption of the Random Forest Classifier model represents a strategic move towards leveraging advanced analytics to improve financial decisions and risk management in banking operations, aligning with broader trends towards data-driven decision-making in the industry.
Key Recommendations to Implement the Solution
Integration with Existing Systems:
Training and Validation:
Continuous Monitoring and Updating:
Key Actionables for Stakeholders
IT and Data Science Teams:
Risk Management Teams:
Executive Leadership:
Key Risks and Challenges
By addressing these recommendations, actionables, and potential risks, the implementation of the Random Forest model can be effectively managed to maximize its benefits and ensure it contributes positively to the organization's strategic objectives.